1. Introduction
Students: Berke ASLAN and William BOZZACCHI
Document content and description
This document is the final output for the ESILV, Year 5 Credit Scoring Project.
The project below is done with the programming language R and the document generated with R Markdown.
2. Libraries et datasets
2.1 Libraries
options(warn=-1)
library(tidyverse)
library(dplyr)
library(knitr)
library(skimr)
library(lubridate)
library(gridExtra)
library(scorecard)
library(ggplot2)
library(ggplotify)
library(plotly)
library(questionr)
library(caret)
library(pROC)
library(purrr)
library(caTools)
2.2 Datasets
Now let’s import the dataset and check their given formats.
# Import dataset from the project directory
raw_data <- read_csv("TD4Data.csv")
data_dict <- read_csv("data_dictionary.csv")
# Split dataset into train/test
set.seed(101)
train <- raw_data %>% sample_frac(.8)
test <- anti_join(raw_data, train, by='UniqueID')
# Summarize the datasets
summary(train)
## UniqueID disbursed_amount asset_cost ltv
## Min. :417614 Min. : 15579 Min. : 39684 Min. :25.72
## 1st Qu.:499762 1st Qu.: 48182 1st Qu.: 66279 1st Qu.:69.83
## Median :569888 Median : 54513 Median : 70802 Median :77.42
## Mean :554380 Mean : 55468 Mean : 76608 Mean :75.43
## 3rd Qu.:610596 3rd Qu.: 60959 3rd Qu.: 78905 3rd Qu.:84.38
## Max. :653991 Max. :153318 Max. :215528 Max. :94.95
## NA's :1 NA's :1 NA's :1
## branch_id supplier_id manufacturer_id Current_pincode_ID
## Min. : 1.00 Min. :12797 Min. : 45.00 Min. : 5
## 1st Qu.: 13.00 1st Qu.:16264 1st Qu.: 48.00 1st Qu.:1484
## Median : 61.00 Median :18709 Median : 86.00 Median :2783
## Mean : 70.17 Mean :19499 Mean : 69.25 Mean :3216
## 3rd Qu.:120.00 3rd Qu.:22991 3rd Qu.: 86.00 3rd Qu.:5232
## Max. :260.00 Max. :24793 Max. :120.00 Max. :7302
## NA's :1 NA's :1
## Date.of.Birth Employment.Type DisbursalDate State_ID
## Length:1311 Length:1311 Length:1311 Min. : 1.000
## Class :character Class :character Class :character 1st Qu.: 4.000
## Mode :character Mode :character Mode :character Median : 6.000
## Mean : 6.905
## 3rd Qu.: 9.000
## Max. :21.000
##
## Employee_code_ID MobileNo_Avl_Flag Aadhar_flag PAN_flag
## Min. : 3.0 Min. :1 Min. :0.0000 Min. :0.00000
## 1st Qu.: 716.2 1st Qu.:1 1st Qu.:1.0000 1st Qu.:0.00000
## Median :1402.5 Median :1 Median :1.0000 Median :0.00000
## Mean :1536.6 Mean :1 Mean :0.8375 Mean :0.07933
## 3rd Qu.:2286.5 3rd Qu.:1 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :3774.0 Max. :1 Max. :1.0000 Max. :1.00000
## NA's :1
## VoterID_flag Driving_flag Passport_flag PERFORM_CNS.SCORE
## Min. :0.0000 Min. :0.00000 Min. :0.000000 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.: 0.0
## Median :0.0000 Median :0.00000 Median :0.000000 Median :563.0
## Mean :0.1426 Mean :0.02822 Mean :0.003051 Mean :415.5
## 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:737.0
## Max. :1.0000 Max. :1.00000 Max. :1.000000 Max. :853.0
##
## PERFORM_CNS.SCORE.DESCRIPTION PRI.NO.OF.ACCTS PRI.ACTIVE.ACCTS
## Length:1311 Min. : 0.000 Min. : 0.00
## Class :character 1st Qu.: 0.000 1st Qu.: 0.00
## Mode :character Median : 2.000 Median : 1.00
## Mean : 3.577 Mean : 1.57
## 3rd Qu.: 4.000 3rd Qu.: 2.00
## Max. :104.000 Max. :24.00
##
## PRI.OVERDUE.ACCTS PRI.CURRENT.BALANCE PRI.SANCTIONED.AMOUNT
## Min. :0.0000 Min. : -3532 Min. : 0
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.: 0
## Median :0.0000 Median : 8164 Median : 23820
## Mean :0.2159 Mean : 217109 Mean : 270625
## 3rd Qu.:0.0000 3rd Qu.: 93738 3rd Qu.: 150000
## Max. :8.0000 Max. :10091566 Max. :10000000
## NA's :1 NA's :1
## PRI.DISBURSED.AMOUNT PRIMARY.INSTAL.AMT NEW.ACCTS.IN.LAST.SIX.MONTHS
## Min. : 0 Min. : 0 Min. : 0.0000
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0.0000
## Median : 23056 Median : 0 Median : 0.0000
## Mean : 271799 Mean : 14506 Mean : 0.5728
## 3rd Qu.: 150000 3rd Qu.: 4750 3rd Qu.: 1.0000
## Max. :10000000 Max. :4189968 Max. :13.0000
## NA's :1
## DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS AVERAGE.ACCT.AGE CREDIT.HISTORY.LENGTH
## Min. :0.0000 Length:1311 Length:1311
## 1st Qu.:0.0000 Class :character Class :character
## Median :0.0000 Mode :character Mode :character
## Mean :0.1373
## 3rd Qu.:0.0000
## Max. :7.0000
##
## NO.OF_INQUIRIES loan_default
## Min. : 0.0000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.:0.0000
## Median : 0.0000 Median :0.0000
## Mean : 0.2906 Mean :0.2586
## 3rd Qu.: 0.0000 3rd Qu.:1.0000
## Max. :13.0000 Max. :1.0000
##
summary(data_dict)
## Variable Name Description
## Length:41 Length:41
## Class :character Class :character
## Mode :character Mode :character
3. Le snorkeling
3.0.1 Chargement des fichiers et premières impressions
Let’s use the library skim to work with our grouped dataset.
skim(train)
| Name | train |
| Number of rows | 1311 |
| Number of columns | 34 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 28 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Date.of.Birth | 0 | 1.00 | 10 | 10 | 0 | 996 | 0 |
| Employment.Type | 37 | 0.97 | 8 | 13 | 0 | 2 | 0 |
| DisbursalDate | 0 | 1.00 | 10 | 10 | 0 | 83 | 0 |
| PERFORM_CNS.SCORE.DESCRIPTION | 0 | 1.00 | 10 | 55 | 0 | 19 | 0 |
| AVERAGE.ACCT.AGE | 0 | 1.00 | 9 | 11 | 0 | 82 | 0 |
| CREDIT.HISTORY.LENGTH | 0 | 1.00 | 9 | 11 | 0 | 130 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| UniqueID | 1 | 1 | 554379.60 | 68099.09 | 417614.00 | 499762.00 | 569888.00 | 610596.50 | 653991.00 | ▅▃▅▇▇ |
| disbursed_amount | 1 | 1 | 55467.52 | 13854.46 | 15579.00 | 48182.00 | 54513.00 | 60959.00 | 153318.00 | ▂▇▁▁▁ |
| asset_cost | 1 | 1 | 76608.23 | 20066.28 | 39684.00 | 66279.25 | 70802.00 | 78905.25 | 215528.00 | ▇▃▁▁▁ |
| ltv | 0 | 1 | 75.43 | 11.61 | 25.72 | 69.83 | 77.42 | 84.38 | 94.95 | ▁▁▂▇▆ |
| branch_id | 0 | 1 | 70.17 | 68.48 | 1.00 | 13.00 | 61.00 | 120.00 | 260.00 | ▇▅▃▁▁ |
| supplier_id | 1 | 1 | 19499.23 | 3541.37 | 12797.00 | 16264.00 | 18709.00 | 22991.25 | 24793.00 | ▃▅▃▃▇ |
| manufacturer_id | 0 | 1 | 69.25 | 21.78 | 45.00 | 48.00 | 86.00 | 86.00 | 120.00 | ▇▁▇▁▁ |
| Current_pincode_ID | 1 | 1 | 3216.27 | 2163.62 | 5.00 | 1484.00 | 2783.00 | 5232.50 | 7302.00 | ▇▇▅▃▆ |
| State_ID | 0 | 1 | 6.91 | 4.35 | 1.00 | 4.00 | 6.00 | 9.00 | 21.00 | ▇▅▂▁▁ |
| Employee_code_ID | 1 | 1 | 1536.63 | 978.28 | 3.00 | 716.25 | 1402.50 | 2286.50 | 3774.00 | ▇▇▆▅▃ |
| MobileNo_Avl_Flag | 0 | 1 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
| Aadhar_flag | 0 | 1 | 0.84 | 0.37 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▂▁▁▁▇ |
| PAN_flag | 0 | 1 | 0.08 | 0.27 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| VoterID_flag | 0 | 1 | 0.14 | 0.35 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| Driving_flag | 0 | 1 | 0.03 | 0.17 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| Passport_flag | 0 | 1 | 0.00 | 0.06 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| PERFORM_CNS.SCORE | 0 | 1 | 415.52 | 335.78 | 0.00 | 0.00 | 563.00 | 737.00 | 853.00 | ▇▁▁▅▇ |
| PRI.NO.OF.ACCTS | 0 | 1 | 3.58 | 5.92 | 0.00 | 0.00 | 2.00 | 4.00 | 104.00 | ▇▁▁▁▁ |
| PRI.ACTIVE.ACCTS | 0 | 1 | 1.57 | 2.33 | 0.00 | 0.00 | 1.00 | 2.00 | 24.00 | ▇▁▁▁▁ |
| PRI.OVERDUE.ACCTS | 0 | 1 | 0.22 | 0.61 | 0.00 | 0.00 | 0.00 | 0.00 | 8.00 | ▇▁▁▁▁ |
| PRI.CURRENT.BALANCE | 1 | 1 | 217109.47 | 728115.00 | -3532.00 | 0.00 | 8164.50 | 93737.50 | 10091566.00 | ▇▁▁▁▁ |
| PRI.SANCTIONED.AMOUNT | 1 | 1 | 270625.16 | 788387.60 | 0.00 | 0.00 | 23819.50 | 150000.00 | 10000000.00 | ▇▁▁▁▁ |
| PRI.DISBURSED.AMOUNT | 1 | 1 | 271799.44 | 809182.76 | 0.00 | 0.00 | 23055.50 | 150000.00 | 10000000.00 | ▇▁▁▁▁ |
| PRIMARY.INSTAL.AMT | 0 | 1 | 14506.10 | 144638.06 | 0.00 | 0.00 | 0.00 | 4750.00 | 4189968.00 | ▇▁▁▁▁ |
| NEW.ACCTS.IN.LAST.SIX.MONTHS | 0 | 1 | 0.57 | 1.14 | 0.00 | 0.00 | 0.00 | 1.00 | 13.00 | ▇▁▁▁▁ |
| DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS | 0 | 1 | 0.14 | 0.45 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
| NO.OF_INQUIRIES | 0 | 1 | 0.29 | 0.90 | 0.00 | 0.00 | 0.00 | 0.00 | 13.00 | ▇▁▁▁▁ |
| loan_default | 0 | 1 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
Choissons quelques variables pour voir la structure de notre dataset.
options(tibble.width = Inf) # displays all columns.
# Check specific data structures
train %>% dplyr::select(ltv,
disbursed_amount,
"Date.of.Birth",
supplier_id,
"Employment.Type",
"PERFORM_CNS.SCORE.DESCRIPTION",
loan_default,
"AVERAGE.ACCT.AGE",
"CREDIT.HISTORY.LENGTH") %>% head()
## # A tibble: 6 × 9
## ltv disbursed_amount Date.of.Birth supplier_id Employment.Type
## <dbl> <dbl> <chr> <dbl> <chr>
## 1 77.2 51578 25/06/1982 15899 Salaried
## 2 60.4 36975 20/08/1966 17916 Self employed
## 3 89.4 64482 25/10/1995 22708 Salaried
## 4 69.6 116908 01/06/1981 23002 Self employed
## 5 59.9 41381 23/07/1985 23483 Self employed
## 6 65.0 72717 09/11/1988 23150 Self employed
## PERFORM_CNS.SCORE.DESCRIPTION loan_default AVERAGE.ACCT.AGE
## <chr> <dbl> <chr>
## 1 B-Very Low Risk 0 4yrs 7mon
## 2 H-Medium Risk 0 1yrs 3mon
## 3 A-Very Low Risk 0 1yrs 1mon
## 4 Not Scored: Only a Guarantor 0 3yrs 10mon
## 5 C-Very Low Risk 1 0yrs 11mon
## 6 Not Scored: Sufficient History Not Available 1 0yrs 0mon
## CREDIT.HISTORY.LENGTH
## <chr>
## 1 4yrs 7mon
## 2 3yrs 0mon
## 3 2yrs 6mon
## 4 3yrs 10mon
## 5 1yrs 8mon
## 6 0yrs 0mon
3.0.2 Echange avec les métiers
Après cette première phase de snorkeling, nous avons pu constater la structure du dataset et les différentes variables. Cette première étape nous permet d’avoir une première idée des variables pertinentes. Nous avons aussi fait des changements sur le type des variables pour retyper les variables en factor pour les modalités.
4. Deep Dive
4.1 Wrangling et mise en forme
Dans cette phase, nous avons décidé de modifier la structure de la variable PERFORM_CNS.SCORE.DESCRIPTION. En effet, cette variable donne une description du score du bureau et présente 19 modalités différentes. Nous avons regroupé les modalités en quatre modalités plus larges de façon à rendre le calcul de la régression logistique plus efficace. Les quatre nouvelles modalités sont les suivantes : High, Low, Medium et Not Scored.
Nous avons ensuite modifié les variables AVERAGE.ACCT.AGE et CREDIT.HISTORY.LENGTH représentant respectivement la durée moyenne des prêts et le temps écoulé depuis le premier prêt. Nous les avons convertis en nombre de mois à la place du format Xyrs Ymon. Cette modification nous permet d’avoir des variables numeric pour étudier leur impact.
De même pour les variables Date.of.Birth et DisbursalDate. Nous les avons respectivement convertis en BorrowerAge (Age de l’emprunteur) et NbrMonthRelation (Nombre de mois pour le décaissement) à la place du format Day/Month/Year.
En plottant les différentes variables, nous nous rendons compte que certaines variables ne sont pas pertinentes car elles n’apportent pas d’informations sur l’individu. C’est le cas des variables Passport.flag et MobileNo_Avl_Flag. En effet, l’ensemble des individus possède la même valeur pour ces deux variables ce qui les rend useless.
# Factor the drivers
wrangled_train <- train %>% mutate(
Employment.Type =replace_na(train$Employment.Type,"None"),
State_ID = as.factor(State_ID),
loan_default = as.factor(loan_default),
VoterID_flag = as.factor(VoterID_flag),
supplier_id = as.factor(supplier_id),
manufacturer_id = as.factor(manufacturer_id),
Current_pincode_ID = as.factor(Current_pincode_ID),
MobileNo_Avl_Flag = as.factor(MobileNo_Avl_Flag),
Driving_flag = as.factor(Driving_flag),
Passport_flag = as.factor(Passport_flag)
)
# Factor the drivers
wrangled_test <- test %>% mutate(
Employment.Type =replace_na(test$Employment.Type,"None"),
State_ID = as.factor(State_ID),
loan_default = as.factor(loan_default),
VoterID_flag = as.factor(VoterID_flag),
supplier_id = as.factor(supplier_id),
manufacturer_id = as.factor(manufacturer_id),
Current_pincode_ID = as.factor(Current_pincode_ID),
MobileNo_Avl_Flag = as.factor(MobileNo_Avl_Flag),
Driving_flag = as.factor(Driving_flag),
Passport_flag = as.factor(Passport_flag)
)
On calcule les pourcentages en lignes avec la fonction lprop:
# Check data occurences
wrangled_train %>% count(PERFORM_CNS.SCORE.DESCRIPTION)
## # A tibble: 19 × 2
## PERFORM_CNS.SCORE.DESCRIPTION n
## <chr> <int>
## 1 A-Very Low Risk 123
## 2 B-Very Low Risk 71
## 3 C-Very Low Risk 138
## 4 D-Very Low Risk 85
## 5 E-Low Risk 38
## 6 F-Low Risk 67
## 7 G-Low Risk 30
## 8 H-Medium Risk 61
## 9 I-Medium Risk 101
## 10 J-High Risk 36
## 11 K-High Risk 67
## 12 L-Very High Risk 11
## 13 M-Very High Risk 68
## 14 No Bureau History Available 305
## 15 Not Scored: No Activity seen on the customer (Inactive) 34
## 16 Not Scored: No Updates available in last 36 months 9
## 17 Not Scored: Not Enough Info available on the customer 34
## 18 Not Scored: Only a Guarantor 10
## 19 Not Scored: Sufficient History Not Available 23
wrangled_train <- wrangled_train %>%
mutate(credit_category = case_when(
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "M-") ~ "High",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "F-") ~ "Low",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "D-") ~ "Low",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "K-") ~ "High",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "B-") ~ "Low",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "G-") ~ "Low",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "I-") ~ "Medium",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "Not Scored") ~ "Not Scored",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "A-") ~ "Low",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "C-") ~ "Low",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "No Bureau") ~ "Not Scored",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "E-") ~ "Low",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "H-") ~ "Medium",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "J-") ~ "High",
startsWith(PERFORM_CNS.SCORE.DESCRIPTION, "L-") ~ "High",
))
# Utilize lprop function to calculate percentages
tab <- table(wrangled_train$PERFORM_CNS.SCORE.DESCRIPTION,wrangled_train$loan_default)
lprop(tab)
##
## 0 1 Total
## A-Very Low Risk 90.2 9.8 100.0
## B-Very Low Risk 90.1 9.9 100.0
## C-Very Low Risk 82.6 17.4 100.0
## D-Very Low Risk 77.6 22.4 100.0
## E-Low Risk 92.1 7.9 100.0
## F-Low Risk 79.1 20.9 100.0
## G-Low Risk 86.7 13.3 100.0
## H-Medium Risk 63.9 36.1 100.0
## I-Medium Risk 72.3 27.7 100.0
## J-High Risk 50.0 50.0 100.0
## K-High Risk 46.3 53.7 100.0
## L-Very High Risk 27.3 72.7 100.0
## M-Very High Risk 39.7 60.3 100.0
## No Bureau History Available 76.1 23.9 100.0
## Not Scored: No Activity seen on the customer (Inactive) 76.5 23.5 100.0
## Not Scored: No Updates available in last 36 months 88.9 11.1 100.0
## Not Scored: Not Enough Info available on the customer 76.5 23.5 100.0
## Not Scored: Only a Guarantor 50.0 50.0 100.0
## Not Scored: Sufficient History Not Available 65.2 34.8 100.0
## All 74.1 25.9 100.0
# Utilize lprop function to calculate percentages
tab_prime <- table(wrangled_train$credit_category,wrangled_train$loan_default)
lprop(tab_prime)
##
## 0 1 Total
## High 43.4 56.6 100.0
## Low 85.0 15.0 100.0
## Medium 69.1 30.9 100.0
## Not Scored 75.2 24.8 100.0
## All 74.1 25.9 100.0
Parsing utilisant le regex.
# Calculate yrs mon parse to months
wrangled_train <- wrangled_train %>%
mutate(AAA = 12 * as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=mon)")),
CHL = 12 * as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=mon)")))
wrangled_train %>% select(AVERAGE.ACCT.AGE,
AAA,
CREDIT.HISTORY.LENGTH,
CHL)
## # A tibble: 1,311 × 4
## AVERAGE.ACCT.AGE AAA CREDIT.HISTORY.LENGTH CHL
## <chr> <dbl> <chr> <dbl>
## 1 4yrs 7mon 55 4yrs 7mon 55
## 2 1yrs 3mon 15 3yrs 0mon 36
## 3 1yrs 1mon 13 2yrs 6mon 30
## 4 3yrs 10mon 46 3yrs 10mon 46
## 5 0yrs 11mon 11 1yrs 8mon 20
## 6 0yrs 0mon 0 0yrs 0mon 0
## 7 0yrs 0mon 0 0yrs 0mon 0
## 8 0yrs 0mon 0 0yrs 2mon 2
## 9 2yrs 3mon 27 5yrs 9mon 69
## 10 0yrs 0mon 0 0yrs 0mon 0
## # … with 1,301 more rows
# Calculate yrs mon parse to months
wrangled_test <- wrangled_test %>%
mutate(AAA = 12 * as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(AVERAGE.ACCT.AGE, pattern="(\\d)+(?=mon)")),
CHL = 12 * as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=yrs)")) + as.numeric(str_extract(CREDIT.HISTORY.LENGTH, pattern="(\\d)+(?=mon)")))
On calcule maintenant l’age de borrower et le nombre de mois en relation avec le disbursal date.
# Insert function to calculate age
# Transform character columns to date
wrangled_train$Date.of.Birth <- dmy(wrangled_train$Date.of.Birth)
wrangled_train$DisbursalDate <- dmy(wrangled_train$DisbursalDate)
# diff_in_days = difftime(datetimes[2], datetimes[1], units = "days")
wrangled_train <- wrangled_train %>% mutate(
BorrowerAge = MESS::age(Date.of.Birth, today()),
NbrMonthRelation = -1 * as.numeric(difftime(DisbursalDate, today(), units = "days") / 30)
)
wrangled_train %>% select(Date.of.Birth,
BorrowerAge,
DisbursalDate,
NbrMonthRelation)
## # A tibble: 1,311 × 4
## Date.of.Birth BorrowerAge DisbursalDate NbrMonthRelation
## <date> <dbl> <date> <dbl>
## 1 1982-06-25 39 2018-10-13 38.5
## 2 1966-08-20 55 2018-10-26 38.1
## 3 1995-10-25 26 2018-10-09 38.6
## 4 1981-06-01 40 2018-10-27 38.0
## 5 1985-07-23 36 2018-09-28 39
## 6 1988-11-09 33 2018-09-18 39.3
## 7 1997-01-01 24 2018-10-27 38.0
## 8 1960-04-15 61 2018-10-11 38.6
## 9 1983-12-22 37 2018-10-30 37.9
## 10 1976-01-01 45 2018-10-10 38.6
## # … with 1,301 more rows
# Insert function to calculate age
# Transform character columns to date
wrangled_test$Date.of.Birth <- dmy(wrangled_test$Date.of.Birth)
wrangled_test$DisbursalDate <- dmy(wrangled_test$DisbursalDate)
# diff_in_days = difftime(datetimes[2], datetimes[1], units = "days")
wrangled_test <- wrangled_test %>% mutate(
BorrowerAge = MESS::age(Date.of.Birth, today()),
NbrMonthRelation = -1 * as.numeric(difftime(DisbursalDate, today(), units = "days") / 30)
)
library(gridExtra)
# Grid plot ggplot
p1 <- ggplot(wrangled_train, aes(x=AAA)) + geom_histogram()
p2 <- ggplot(wrangled_train, aes(x=Aadhar_flag)) + geom_histogram()
p3 <- ggplot(wrangled_train, aes(x=asset_cost)) + geom_histogram()
p4 <- ggplot(wrangled_train, aes(x=BorrowerAge)) + geom_histogram()
p5 <- ggplot(wrangled_train, aes(x=branch_id)) + geom_histogram()
p6 <- ggplot(wrangled_train, aes(x=CHL)) + geom_histogram()
p7 <- ggplot(wrangled_train, aes(x=DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS)) + geom_histogram()
p8 <- ggplot(wrangled_train, aes(x=disbursed_amount)) + geom_histogram()
p9 <- ggplot(wrangled_train, aes(x=Employee_code_ID)) + geom_histogram()
p10 <- ggplot(wrangled_train, aes(x=ltv)) + geom_histogram()
p11 <- ggplot(wrangled_train, aes(x=NbrMonthRelation)) + geom_histogram()
p12 <- ggplot(wrangled_train, aes(x=NEW.ACCTS.IN.LAST.SIX.MONTHS)) + geom_histogram()
p13 <- ggplot(wrangled_train, aes(x=NO.OF_INQUIRIES)) + geom_histogram()
p14 <- ggplot(wrangled_train, aes(x=PAN_flag)) + geom_histogram()
p15 <- ggplot(wrangled_train, aes(x=PERFORM_CNS.SCORE)) + geom_histogram()
p16 <- ggplot(wrangled_train, aes(x=PRI.ACTIVE.ACCTS)) + geom_histogram()
p17 <- ggplot(wrangled_train, aes(x=PRI.CURRENT.BALANCE)) + geom_histogram()
p18 <- ggplot(wrangled_train, aes(x=PRI.DISBURSED.AMOUNT)) + geom_histogram()
p19 <- ggplot(wrangled_train, aes(x=PRI.NO.OF.ACCTS)) + geom_histogram()
p20 <- ggplot(wrangled_train, aes(x=PRI.OVERDUE.ACCTS)) + geom_histogram()
p21 <- ggplot(wrangled_train, aes(x=PRI.SANCTIONED.AMOUNT)) + geom_histogram()
p22 <- ggplot(wrangled_train, aes(x=PRIMARY.INSTAL.AMT)) + geom_histogram()
p23 <- ggplot(wrangled_train, aes(x=UniqueID)) + geom_histogram()
grid.arrange(p1, p2, p3, p4,
p5, p6, p7, p8,
p9, p10, p11, p12,
p13, p14, p15, p16,
p17, p18, p19, p20,
p21, p22, p23, ncol=3, nrow = 8)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Categoriel.
p1 <- ggplot(wrangled_train, aes(x=Current_pincode_ID)) + geom_histogram(stat="count")
p2 <- ggplot(wrangled_train, aes(x=Driving_flag)) + geom_histogram(stat="count")
p3 <- ggplot(wrangled_train, aes(x=loan_default)) + geom_histogram(stat="count")
p4 <- ggplot(wrangled_train, aes(x=manufacturer_id)) + geom_histogram(stat="count")
p5 <- ggplot(wrangled_train, aes(x=MobileNo_Avl_Flag)) + geom_histogram(stat="count")
p6 <- ggplot(wrangled_train, aes(x=Passport_flag)) + geom_histogram(stat="count")
p7 <- ggplot(wrangled_train, aes(x=State_ID)) + geom_histogram(stat="count")
p8 <- ggplot(wrangled_train, aes(x=supplier_id)) + geom_histogram(stat="count")
p9 <- ggplot(wrangled_train, aes(x=VoterID_flag)) + geom_histogram(stat="count")
grid.arrange(p1, p2, p3, p4,
p5, p6, p7, p8,
p9, ncol=3, nrow = 3)
4.3 Sélection des variables pour le modèle
4.3.1 Analyses univariées & bivariées
En réalisant le boxplot de la variable BorrowerAge, nous nous rendons compte que la saisie d’une des individus est fausse. En effet, le boxplot nous indique qu’un individu a 135 ans ce qui est impossible. On observe aussi un outlier, un individu en défaut ayant 92 ans.
boxplot(wrangled_train$BorrowerAge,main="Age de le borrower",
xlab="L ecart inter-quartile h", ylab="Heures",horizontal = FALSE)
Boxplot + violon plot.
p1 <- ggplot(wrangled_train, aes(x=wrangled_train$BorrowerAge)) + geom_boxplot() + xlab("BorrowerAge") + ylab("loan_default")
p2 <- ggplot(wrangled_train, aes(x=wrangled_train$BorrowerAge, y=wrangled_train$loan_default)) + geom_boxplot() + xlab("BorrowerAge") + ylab("loan_default")
p3 <- ggplot(wrangled_train, aes(x=wrangled_train$BorrowerAge, y=wrangled_train$loan_default)) + geom_violin() + geom_boxplot(width=.1) + xlab("BorrowerAge") + ylab("loan_default")
grid.arrange(p1, p2, p3, ncol=2, nrow = 2)
4.3.2 Discrétisation des variables
Dans cette section nous discrétisons les variables afin de réaliser notre régression logistique. Cette étape présente une perte d’informations mais elle est nécessaire pour le calcul de la régression logistique dans la mesure où un nombre trop important de valeurs rend le calcul impossible.
Les différents graphes nous donnent des informations pertinentes sur l’impact des différentes variables dans notre modèle.
Par exemple, le graphe de la variable ltv (Rapport prêt/valeur de l’actif) nous indique qu’il s’agit d’une variable croissante du défaut. La variable ltv a été discrétiser en trois classes [0 ; 55], [55, 85] et [85 ; inf]. Le pourcentage en bleu sur la courbe représente le rapport entre la proportion de rouge (bas) sur la population entière. On observe que plus la ltv est importante et donc plus le prêt est important par rapport à la valeur de l’actif plus l’individu a des chances de faire défaut.
Le graphe de la PERFORM_CNS.SCORE.DESCRIPTION nous montre aussi qu’il s’agit d’une variable croissante du défaut. En effet, le ratio d’individu bad sur l’ensemble de la population augmente en fonction de la notation du bureau (on observe 56% de pour le score high ce qui signifie que plus d’un individu sur 2 fait défaut dans cette catégorie).
to_plot_cat <- c("ltv", "disbursed_amount", "PRI.CURRENT.BALANCE",
"PRI.DISBURSED.AMOUNT", "PRIMARY.INSTAL.AMT", "BorrowerAge",
"AAA", "PERFORM_CNS.SCORE.DESCRIPTION", "PRI.OVERDUE.ACCTS",
"DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS")
for(i in to_plot_cat)
{
bins = woebin(wrangled_train, y="loan_default", x=i,bin_num_limit=3)
p1 = woebin_plot(bins)
print(p1)
}
## [INFO] creating woe binning ...
## $ltv
##
## [INFO] creating woe binning ...
## $disbursed_amount
##
## [INFO] creating woe binning ...
## $PRI.CURRENT.BALANCE
##
## [INFO] creating woe binning ...
## $PRI.DISBURSED.AMOUNT
##
## [INFO] creating woe binning ...
## $PRIMARY.INSTAL.AMT
##
## [INFO] creating woe binning ...
## $BorrowerAge
##
## [INFO] creating woe binning ...
## $AAA
##
## [INFO] creating woe binning ...
## $PERFORM_CNS.SCORE.DESCRIPTION
##
## [INFO] creating woe binning ...
## $PRI.OVERDUE.ACCTS
##
## [INFO] creating woe binning ...
## $DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS
4.3.3 Pouvoir discriminant des variables
iv = iv(wrangled_train, y = 'loan_default') %>%
as_tibble() %>%
mutate( info_value = round(info_value, 3) ) %>%
arrange( desc(info_value) )
iv %>%
knitr::kable()
| variable | info_value |
|---|---|
| PERFORM_CNS.SCORE | 0.573 |
| PERFORM_CNS.SCORE.DESCRIPTION | 0.564 |
| credit_category | 0.464 |
| branch_id | 0.374 |
| PRI.DISBURSED.AMOUNT | 0.361 |
| PRI.SANCTIONED.AMOUNT | 0.358 |
| PRIMARY.INSTAL.AMT | 0.353 |
| CREDIT.HISTORY.LENGTH | 0.326 |
| CHL | 0.326 |
| DisbursalDate | 0.317 |
| NbrMonthRelation | 0.317 |
| supplier_id | 0.316 |
| PRI.CURRENT.BALANCE | 0.313 |
| disbursed_amount | 0.240 |
| PRI.OVERDUE.ACCTS | 0.234 |
| AVERAGE.ACCT.AGE | 0.222 |
| AAA | 0.222 |
| DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS | 0.199 |
| BorrowerAge | 0.194 |
| Current_pincode_ID | 0.181 |
| Date.of.Birth | 0.155 |
| Employee_code_ID | 0.150 |
| PRI.NO.OF.ACCTS | 0.120 |
| ltv | 0.112 |
| PRI.ACTIVE.ACCTS | 0.100 |
| State_ID | 0.083 |
| NO.OF_INQUIRIES | 0.055 |
| asset_cost | 0.048 |
| NEW.ACCTS.IN.LAST.SIX.MONTHS | 0.022 |
| manufacturer_id | 0.013 |
| Employment.Type | 0.007 |
| VoterID_flag | 0.006 |
| Aadhar_flag | 0.005 |
| Passport_flag | 0.004 |
| PAN_flag | 0.003 |
| Driving_flag | 0.003 |
| UniqueID | 0.000 |
| MobileNo_Avl_Flag | 0.000 |
# bins = woebin(wrangled_train, y = 'loan_default')
data_woe = woebin_ply(wrangled_train, bins) %>%
as_tibble()
## [INFO] converting into woe values ...
data_woe
## # A tibble: 1,311 × 39
## UniqueID disbursed_amount asset_cost ltv branch_id supplier_id
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 576629 51578 68000 77.2 2 15899
## 2 624008 36975 66166 60.4 1 17916
## 3 564064 64482 73813 89.4 18 22708
## 4 628786 116908 174438 69.6 85 23002
## 5 545869 41381 71570 59.9 18 23483
## 6 515616 72717 115240 65.0 11 23150
## 7 631989 72299 101666 74.8 3 17038
## 8 568869 59388 71500 85 67 16694
## 9 644157 50278 74325 68.9 77 18238
## 10 567106 65882 89311 76.1 36 23697
## manufacturer_id Current_pincode_ID Date.of.Birth Employment.Type
## <fct> <fct> <date> <chr>
## 1 51 1720 1982-06-25 Salaried
## 2 86 4938 1966-08-20 Self employed
## 3 86 2697 1995-10-25 Salaried
## 4 67 1935 1981-06-01 Self employed
## 5 51 2721 1985-07-23 Self employed
## 6 48 5956 1988-11-09 Self employed
## 7 51 571 1997-01-01 Salaried
## 8 86 1482 1960-04-15 Self employed
## 9 120 2343 1983-12-22 Salaried
## 10 120 6619 1976-01-01 Salaried
## DisbursalDate State_ID Employee_code_ID MobileNo_Avl_Flag Aadhar_flag
## <date> <fct> <dbl> <fct> <dbl>
## 1 2018-10-13 4 1635 1 1
## 2 2018-10-26 3 1101 1 1
## 3 2018-10-09 4 1709 1 1
## 4 2018-10-27 4 983 1 1
## 5 2018-09-28 4 451 1 1
## 6 2018-09-18 3 3 1 1
## 7 2018-10-27 15 624 1 1
## 8 2018-10-11 6 169 1 1
## 9 2018-10-30 4 2197 1 1
## 10 2018-10-10 13 1335 1 0
## PAN_flag VoterID_flag Driving_flag Passport_flag PERFORM_CNS.SCORE
## <dbl> <fct> <fct> <fct> <dbl>
## 1 0 0 0 0 783
## 2 0 0 0 0 611
## 3 0 0 0 0 821
## 4 1 0 0 0 14
## 5 0 0 0 0 738
## 6 0 0 0 0 15
## 7 1 0 0 0 0
## 8 0 0 0 0 362
## 9 0 0 0 0 587
## 10 0 1 0 0 0
## PERFORM_CNS.SCORE.DESCRIPTION PRI.NO.OF.ACCTS PRI.ACTIVE.ACCTS
## <chr> <dbl> <dbl>
## 1 B-Very Low Risk 1 1
## 2 H-Medium Risk 24 5
## 3 A-Very Low Risk 4 2
## 4 Not Scored: Only a Guarantor 1 0
## 5 C-Very Low Risk 7 4
## 6 Not Scored: Sufficient History Not Available 1 1
## 7 No Bureau History Available 0 0
## 8 K-High Risk 3 3
## 9 I-Medium Risk 5 5
## 10 No Bureau History Available 0 0
## PRI.OVERDUE.ACCTS PRI.CURRENT.BALANCE PRI.SANCTIONED.AMOUNT
## <dbl> <dbl> <dbl>
## 1 0 771804 1600000
## 2 1 280921 425111
## 3 0 2302 39906
## 4 0 0 0
## 5 0 53462 76536
## 6 0 11076 11076
## 7 0 0 0
## 8 1 66443 77150
## 9 2 709398 772628
## 10 0 0 0
## PRI.DISBURSED.AMOUNT PRIMARY.INSTAL.AMT NEW.ACCTS.IN.LAST.SIX.MONTHS
## <dbl> <dbl> <dbl>
## 1 1600000 15707 0
## 2 425111 4014 0
## 3 9163 1399 0
## 4 0 0 0
## 5 76536 8538 1
## 6 11076 1496 1
## 7 0 0 0
## 8 77150 6884 3
## 9 773389 8519 1
## 10 0 0 0
## AVERAGE.ACCT.AGE CREDIT.HISTORY.LENGTH NO.OF_INQUIRIES loan_default
## <chr> <chr> <dbl> <fct>
## 1 4yrs 7mon 4yrs 7mon 0 0
## 2 1yrs 3mon 3yrs 0mon 0 0
## 3 1yrs 1mon 2yrs 6mon 0 0
## 4 3yrs 10mon 3yrs 10mon 0 0
## 5 0yrs 11mon 1yrs 8mon 1 1
## 6 0yrs 0mon 0yrs 0mon 1 1
## 7 0yrs 0mon 0yrs 0mon 0 1
## 8 0yrs 0mon 0yrs 2mon 0 1
## 9 2yrs 3mon 5yrs 9mon 0 0
## 10 0yrs 0mon 0yrs 0mon 0 1
## credit_category AAA CHL BorrowerAge NbrMonthRelation
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Low 55 55 39 38.5
## 2 Medium 15 36 55 38.1
## 3 Low 13 30 26 38.6
## 4 Not Scored 46 46 40 38.0
## 5 Low 11 20 36 39
## 6 Not Scored 0 0 33 39.3
## 7 Not Scored 0 0 24 38.0
## 8 High 0 2 61 38.6
## 9 Medium 27 69 37 37.9
## 10 Not Scored 0 0 45 38.6
## DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS_woe
## <dbl>
## 1 -0.173
## 2 -0.173
## 3 -0.173
## 4 -0.173
## 5 -0.173
## 6 -0.173
## 7 -0.173
## 8 1.09
## 9 1.09
## 10 -0.173
## # … with 1,301 more rows
4.3.4 Corrélations intra modèle (Étape optionnelle)
La matrice de corrélation obtenue avec le V de Cramer nous permet de connaitre la corrélation des différentes variables avec loan_default. Dans un premier temps, nous avons calculé la corrélation de toutes les variables présentes dans notre dataset. Après une première étude des résultats, nous avons décidés d’écarter certaines variables n’étant pas assez corrélées avec la variable loan_default. Dans un second temps, nous avons réalisé à nouveau un V de Cramer avec nos variables sélectionnées et nous obtenons de nouvelles valeurs de corrélation. Nous observons que 3 variables ont une bonne corrélation avec la variable loan_default, il s’agit des variables ltv_bin, NbrMonthRelation_bin, et PERFORM_CNS.SCORE.DESCRIPTION_bin.
Cependant la corrélation de la variable ltv avec loan_default_ ainsi qu’avec les autres variables semble être erroné malgré le sens de cette variable vis-à-vis de notre étude.
Nous pouvons ainsi réaliser notre régression logistique avec ces variables.
#
# ALL_COLUMN_NAME <- wrangled_train %>% colnames()
#
# ## Fonction de calcul d'un v de cramer
# fCramerFunction = function(x,y) {
# tbl = wrangled_train %>% select(x,y) %>% table()
# cramV = round(cramer.v(tbl), 2)
# data.frame(x, y, cramV) }
#
# # create unique combinations of column names
# # sorting will help getting a better plot (upper triangular)
# df_comb = data.frame(t(combn(sort(c(ALL_COLUMN_NAME,"loan_default")), 2)), stringsAsFactors = F)
#
# # apply function to each variable combination
# df_res = purrr::map2_df(df_comb$X1, df_comb$X2, fCramerFunction)
#
# df_res
#
# # plot results
# df_res %>%
# ggplot(aes(x,y,fill=cramV))+
# geom_tile()+
# geom_text(aes(x,y,label=cramV))+
# scale_fill_gradient(low="white", high="red")+
# theme_classic()+ theme(axis.text.x = element_text(angle = 60, hjust = 1))
#############################################
SELECTED_COLUMNS <- c("State_ID",
"PERFORM_CNS.SCORE.DESCRIPTION",
"PAN_flag",
"NbrMonthRelation",
"ltv",
"loan_default",
"Employment.Type",
"BorrowerAge",
"Aadhar_flag")
small_df <- wrangled_train %>% select(
SELECTED_COLUMNS
)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(SELECTED_COLUMNS)` instead of `SELECTED_COLUMNS` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
fCramerFunction = function(x,y) {
tbl = small_df %>% select(x,y) %>% table()
cramV = round(cramer.v(tbl), 2)
data.frame(x, y, cramV) }
df_comb = data.frame(t(combn(sort(c(SELECTED_COLUMNS,"loan_default")), 2)), stringsAsFactors = F)
# apply function to each variable combination
df_res = purrr::map2_df(df_comb$X1, df_comb$X2, fCramerFunction)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(x)` instead of `x` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(y)` instead of `y` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
df_res
## x y cramV
## 1 Aadhar_flag BorrowerAge 0.17
## 2 Aadhar_flag Employment.Type 0.05
## 3 Aadhar_flag loan_default 0.03
## 4 Aadhar_flag loan_default 0.03
## 5 Aadhar_flag ltv 0.90
## 6 Aadhar_flag NbrMonthRelation 0.24
## 7 Aadhar_flag PAN_flag 0.18
## 8 Aadhar_flag PERFORM_CNS.SCORE.DESCRIPTION 0.12
## 9 Aadhar_flag State_ID 0.58
## 10 BorrowerAge Employment.Type 0.56
## 11 BorrowerAge loan_default 0.21
## 12 BorrowerAge loan_default 0.21
## 13 BorrowerAge ltv 0.87
## 14 BorrowerAge NbrMonthRelation 0.26
## 15 BorrowerAge PAN_flag 0.21
## 16 BorrowerAge PERFORM_CNS.SCORE.DESCRIPTION 0.20
## 17 BorrowerAge State_ID 0.18
## 18 Employment.Type loan_default 0.04
## 19 Employment.Type loan_default 0.04
## 20 Employment.Type ltv 0.88
## 21 Employment.Type NbrMonthRelation 0.24
## 22 Employment.Type PAN_flag 0.04
## 23 Employment.Type PERFORM_CNS.SCORE.DESCRIPTION 0.13
## 24 Employment.Type State_ID 0.26
## 25 loan_default loan_default NA
## 26 loan_default ltv 0.90
## 27 loan_default NbrMonthRelation 0.25
## 28 loan_default PAN_flag 0.03
## 29 loan_default PERFORM_CNS.SCORE.DESCRIPTION 0.33
## 30 loan_default State_ID 0.14
## 31 loan_default ltv 0.90
## 32 loan_default NbrMonthRelation 0.25
## 33 loan_default PAN_flag 0.03
## 34 loan_default PERFORM_CNS.SCORE.DESCRIPTION 0.33
## 35 loan_default State_ID 0.14
## 36 ltv NbrMonthRelation 0.88
## 37 ltv PAN_flag 0.91
## 38 ltv PERFORM_CNS.SCORE.DESCRIPTION 0.90
## 39 ltv State_ID 0.90
## 40 NbrMonthRelation PAN_flag 0.24
## 41 NbrMonthRelation PERFORM_CNS.SCORE.DESCRIPTION 0.30
## 42 NbrMonthRelation State_ID 0.26
## 43 PAN_flag PERFORM_CNS.SCORE.DESCRIPTION 0.12
## 44 PAN_flag State_ID 0.34
## 45 PERFORM_CNS.SCORE.DESCRIPTION State_ID 0.12
# plot results
df_res %>%
ggplot(aes(x,y,fill=cramV))+
geom_tile()+
geom_text(aes(x,y,label=cramV))+
scale_fill_gradient(low="white", high="red")+
theme_classic()+ theme(axis.text.x = element_text(angle = 60, hjust = 1))
4.4 Régression logistique
Enfin, nous avons sélectionné nos variables en calculant le pouvoir discriminant des variables grâce à l’information value ainsi que la corrélation des variables avec notre cible loan_default. En fonction des résultats obtenus, nous avons décidé de prendre 3 variables pour calculer notre régression, il s’agit des variables ltv, PERFORM_CNS.SCORE.DESCRIPTION et NbrMonthRelation. Au-delà du pouvoir discriminant et de la corrélation de ces variables, elles ont du sens vis-à-vis de notre cas d’étude, ce qui fait sens.
Pour notre régression logistique, nous obtenons ce graphique avec une AUC = 0.6959. C’est une valeur élevée ce qui signifie que notre modèle est performant.
feature_vars = c("ltv", "NbrMonthRelation", "PERFORM_CNS.SCORE.DESCRIPTION")
glm_formulation_vars <- reformulate(termlabels = feature_vars,
response="loan_default")
logit = glm(glm_formulation_vars,
data=wrangled_train,
family=binomial(link="logit"))
wrangled_test$Score = predict(logit,
newdata = wrangled_test,
type="response")
roc_obj = roc(wrangled_test$loan_default ~ wrangled_test$Score,
plot=TRUE, percent=TRUE, auc=TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_obj$auc
## Area under the curve: 69.59%
ggplot(wrangled_test, aes(x=Score, color=loan_default)) + geom_density()